Goto

Collaborating Authors

 observation history





Fighting Copycat Agents in Behavioral Cloning from Observation Histories

Neural Information Processing Systems

Imitation learning trains policies to map from input observations to the actions that an expert would choose. In this setting, distribution shift frequently exacerbates the effect of misattributing expert actions to nuisance correlates among the observed variables. We observe that a common instance of this causal confusion occurs in partially observed settings when expert actions are strongly correlated over time: the imitator learns to cheat by predicting the expert's previous action, rather than the next action. To combat this copycat problem, we propose an adversarial approach to learn a feature representation that removes excess information about the previous expert action nuisance correlate, while retaining the information necessary to predict the next action. In our experiments, our approach improves performance significantly across a variety of partially observed imitation learning tasks.


Regularized Behavior Cloning for Blocking the Leakage of Past Action Information

Neural Information Processing Systems

For partially observable environments, imitation learning with observation histories (ILOH) assumes that control-relevant information is sufficiently captured in the observation histories for imitating the expert actions. In the offline setting wherethe agent is required to learn to imitate without interaction with the environment, behavior cloning (BC) has been shown to be a simple yet effective method for imitation learning. However, when the information about the actions executed in the past timesteps leaks into the observation histories, ILOH via BC often ends up imitating its own past actions. In this paper, we address this catastrophic failure by proposing a principled regularization for BC, which we name Past Action Leakage Regularization (PALR). The main idea behind our approach is to leverage the classical notion of conditional independence to mitigate the leakage. We compare different instances of our framework with natural choices of conditional independence metric and its estimator. The result of our comparison advocates the use of a particular kernel-based estimator for the conditional independence metric. We conduct an extensive set of experiments on benchmark datasets in order to assess the effectiveness of our regularization method. The experimental results show that our method significantly outperforms prior related approaches, highlighting its potential to successfully imitate expert actions when the past action information leaks into the observation histories.





Online Competitive Information Gathering for Partially Observable Trajectory Games

Krusniak, Mel, Xu, Hang, Palermo, Parker, Laine, Forrest

arXiv.org Artificial Intelligence

Game-theoretic agents must make plans that optimally gather information about their opponents. These problems are modeled by partially observable stochastic games (POSGs), but planning in fully continuous POSGs is intractable without heavy offline computation or assumptions on the order of belief maintained by each player. We formulate a finite history/horizon refinement of POSGs which admits competitive information gathering behavior in trajectory space, and through a series of approximations, we present an online method for computing rational trajectory plans in these games which leverages particle-based estimations of the joint state space and performs stochastic gradient play. We also provide the necessary adjustments required to deploy this method on individual agents. The method is tested in continuous pursuit-evasion and warehouse-pickup scenarios (alongside extensions to $N > 2$ players and to more complex environments with visual and physical obstacles), demonstrating evidence of active information gathering and outperforming passive competitors.


Observational Learning with a Budget

Wu, Shuo, Poojary, Pawan, Berry, Randall

arXiv.org Artificial Intelligence

--We consider a model of Bayesian observational learning in which a sequence of agents receives a private signal about an underlying binary state of the world. Each agent makes a decision based on its own signal and its observations of previous agents. A central planner seeks to improve the accuracy of these signals by allocating a limited budget to enhance signal quality across agents. We formulate and analyze the budget allocation problem and propose two optimal allocation strategies. At least one of these strategies is shown to maximize the probability of achieving a correct information cascade. I NTRODUCTION Consider that an item, which could either be of a "good" or a "bad" quality, is up for sale in a market where agents arrive sequentially and decide whether to buy the item, with their choice serving as a recommendation for later agents. While the quality of the item is unknown to the agents, every agent has its own prior knowledge of the item's quality in the form of its private belief. Each agent then makes a payoff optimal decision based on its own prior knowledge and by observing the choices of its predecessors. Such models of "observational learning" were first studied by [1]-[3] under a Bayesian learning framework wherein each agent's prior knowledge is in the form of a privately observed signal about the pay-off-relevant state of the world, which in this case is the item's quality, and is generated from a commonly known probability distribution. A salient feature of such models is the emergence of information cascades or herding, i.e., at some point, it is optimal for an agent to ignore its own private signal and follow the actions of the past agents. Subsequent agents then follow suit due to their homogeneity.